These are the goals for today’s lecture
Demonstrate to you that there’s a reproducibility crisis
Explain steps you can take to improve the reproducibility of your research
Identify the meanings of Open Access and Open Data
We’re going to spill into the workshop time a little bit today.
Have you ever heard this phrase before?
Back in 2005 a ground breaking paper by John Ioannidis1 exposed an unsettling truth
Despite lots being written about the crisis… it’s still here. And with the rise of
“machine learning can solve anything!”
… the crisis is evolving and getting more complicated, as reported by Douglas Heaven2
Quite a lot of the time you’ll see these two words used interchangeably - but they have specific meanings. The American Statistical Association (ASA) provides useful advice from Broman et al3.
Reproducibility: A study is reproducible if you can take the original data and the computer code used to analyze the data and reproduce all of the numerical findings from the study.
Reproducibility: A study is reproducible if you can take the original data and the computer code used to analyze the data and reproduce all of the numerical findings from the study.
Reproducibility is something to think about from the start of a research project:
Plan to record and document all processes in data collection, wrangling and analysis
While performing the research keep track of all the things you do - particularly with data!
When writing up your research ensure that all necessary materials to reproduce your findings are made available.
Later in this lecture and the workshop we’ll look at specific advice for achieving this.
Replicability: This is the act of repeating an entire study, independently of the original investigator without the use of original data (but generally using the same methods)
This is a slightly harder topic to conceptualise and much of the lack of replicability comes from what we call “questionable research practices” (QRPs)
… these misbehaviours lie somewhere on a continuum between scientific fraud, bias, and simple carelessness, so their direct inclusion in the “falsification” category is debatable, although their negative impact on research can be dramatic
Fanelli, 20094
There are lots of different ways to summarise common QRPs but I quite like this table from 2012 by John et al5.
| Questionable Research Practice | Self-admission rate (Amongst 2,000 pysychologists) |
|---|---|
| In a paper, failing to report all of a study’s dependent measures | 63.4% |
| Deciding whether to collect more data after looking to see whether the results were significant | 55.9% |
| In a paper, selectively reporting studies that “worked” | 45.8% |
| Deciding whether to exclude data after looking at the impact of doing so on the results | 38.2% |
| In a paper, failing to report all of a study’s conditions | 27.7% |
| In a paper, reporting an unexpected finding as having been predicted from the start | 27.0% |
| In a paper, “rounding off” a p value (e.g., reporting that a p value of .054 is less than .05) | 22.0% |
| Stopping collecting data earlier than planned because one found the result that one had been looking for | 15.6% |
| In a paper, claiming that results are unaffected by demographic variables (e.g., gender) when one is actually unsure (or knows that they do) | 3.0% |
| Falsifying data | 0.6% |
The majority of these QRPs can be categorised as “P-hacking” or more fully - hacking the P-value.
| Questionable Research Practice | Self-admission rate (Amongst 2,000 pysychologists) |
|---|---|
| In a paper, failing to report all of a study’s dependent measures | 63.4% |
| Deciding whether to collect more data after looking to see whether the results were significant | 55.9% |
| In a paper, selectively reporting studies that “worked” | 45.8% |
| Deciding whether to exclude data after looking at the impact of doing so on the results | 38.2% |
| In a paper, failing to report all of a study’s conditions | 27.7% |
| In a paper, reporting an unexpected finding as having been predicted from the start | 27.0% |
| In a paper, “rounding off” a p value (e.g., reporting that a p value of .054 is less than .05) | 22.0% |
| Stopping collecting data earlier than planned because one found the result that one had been looking for | 15.6% |
| In a paper, claiming that results are unaffected by demographic variables (e.g., gender) when one is actually unsure (or knows that they do) | 3.0% |
| Falsifying data | 0.6% |
We’re going to talk about p-values A LOT in week 10. For now I wanted to borrow a slide from Lucy D’Agostino McGowan’s talk which is well explained in this Twitter thread.
In some situations “p-values” are considered infallible evidence of an effect or the conclusion of a study.
There are lots of different p-value thresholds, but the most common in healthcare data science is 0.05
Researchers who find their study results in values just above 0.05 will explore ways to get the value below 0.05
That’s p-value hacking.
In 2019 Aschwanden6 published an article in Wired.com titled “We’re All ‘P-Hacking’ Now” which I highly recommend reading.
It highlights an excellent study by Simmons et al7 that was able to use p-value hacking to make two increasingly absurd conclusions:
Study 1: Listening to a children’s song (“Hot Potato” by The Wiggles) makes people feel older.
Study 2: Listening to a song about old age (“When I’m Sixty-Four” by The Beatles) makes people actually younger.
… I recommend this paper because it makes very clear recommendations to researchers and reviewers.
Always expect to read a paper multiple times and make notes.
We can kind of neatly split papers into two different types:
Clinical trials like this one from Hu et al8.
In some clinical trials the abstract includes a lot of structured information - it depends on the publisher.
Some journals like BMJ Open even include study strengths and weaknesses.
Read these papers in this order:
Abstract
Tables and/or figures
Conclusions
Introduction
But most papers look this one from Simmons et alsimmons_false?.
Unlike in some medical journals, the abstract is unstructured and might not contain much quantitative information.
But the reading order for the paper is the same:
Abstract
Tables and/or figures
Conclusions
Introduction
On your first pass of a paper you are trying to understand if the paper is relevant and provides substantial information and/or evidence for your needs.
Often it will take another 1 or 2 passes to understand the results of the paper.
It usually takes even more effort to understand the methods of the paper
Methodology information is often provided in “supplementary materials”
But unfortunately a good portion of the time you won’t find sufficient information to fully understand the methodology because of poor reproducibility.
Learning about replicability and reproducibility now will help you in understanding the existing literature and prepare you to succeed in a research career later on.
If you decide to go into a research career you’ll likely be reading 10+ papers a week.
I’d highly recommend investing in learning speed reading early in your career.
There’s lots of [very interesting] eye tracking and neurological research into how we read and evidenced methodologies for speed reading that are nicely summarised by Clifton et al9.
BCU gives you free access to LinkedIn Learning.
Go to linkedin.com/learning-login/ and login with your BCU email address.
So far I’ve been speaking about the academic literature at large.
But let’s look specifically at clinical trials.
All clinical trials began after July 1st 2005 are explicitly required to be registered1 in order to be published in all biomedical journals overseen by the International Committee of Medical Journal Editors (ICMJE)10.
This is independent of the country in which the trial took place.
[1] Frustratingly in the clinical trials community we use the phrase “registration” but everyone else says “pre-registration”.
From a cursory search for clinical trial registration information you might come to the conclusion that the advice is only appropriate for studies in the US.
That is not true.
In 2008 the World Medical Association11 updated the Declaration of Helsinki – Ethical Principles for Medical Research Involving Human Subjects to include a paragraph about registration.
19. Every clinical trial must be registered in a publicly accessible database before recruitment of the first subject.
The NIH provides a really useful tool for comparing the clinical research regulations from several countries:
Despite these “requirements”… violations are still common as detailed in Bradley et al12.
However, the UK has announced new infrastructure requiring 100% clinical trial registration via a collaboration between the Human Research Authority and the ISRCTN registry. See Bruckner13 for a thorough overview of what’s changing.
For unexplainable reasons there is a lot written about registration of trials “reducing research waste” as it helps reduce duplication of studies.
While elements of this are true… the big takeaway is pre-registration helps prevent Questionable Research Practices.
In fact it’s the reason why GSK reached a $2.5million settlement in 2004
GSK chose to settle a civil case instead of engaging in an expensive legal battle over “repeated and persistent fraud” concerning the use of paroxetine in treating depression in adolescents.
As detailed in 2004 by Dyer14:
Two studies showed no benefit to using paroxetine when compared with placebo.
Three studies found evidence for an increase in suicidal thoughts and behaviour.
Internal company documents confirmed the suppression of these results:
“it would be commercially unacceptable to include a statement that efficacy had not been demonstrated, as this would undermine the profile of paroxetine.”
Spurgeon 200415
There’s a growing body of researchers both actually doing pre-registration and calling for it in all disciplines - particularly those that intersect with healthcare.
Simmons et al7 have been running aspredicted.org since 2015 to help authors to create pre-registration reports.
The Centre for Open Science encourages pre-registration on OSF and has in the past run a Preregistration Challenge with a monetary prize.
I highly recommend reading Nosek et al16 which gives a great overview of pre-registration and walk through multiple examples of how it works in practice.
At the moment there’s nothing binding you to pre-register non-clinical trials but this topic is simmering away in the background
In prediction markets, investors make predictions of future events by buying shares in the outcome of the event and the market price indicates what the crowd thinks the probability of the event is.
This was first applied to predicting replicability of research results in 2015 by Dreber et al17.
There is now evidence for these markets being a reliable estimate of predictability18.
Prof. Anna Dreber gives an excellent 40minute overview of replication prediction markets here - https://youtu.be/a5rFDKB1aZc?t=1036
Most (if not all) of studies about replicability are deeply technical and rely on statistical methods we don’t have time to cover in this course.
This includes the foundational paper “Why Most Published Research Findings Are False” by Ioannidis1.
Your assessment does not require you to understand or replicate any of the methodologies behind replicability studies.
Let’s get back to the ASA reproducibility recommendations from Broman et al3.
Reproducibility: A study is reproducible if you can take the original data and the computer code used to analyze the data and reproduce all of the numerical findings from the study.
In order for reproducibility to be possible we need the original data to be accessible
We need Open Data.
More often than papers simply do not provide the data that they:
Use to create charts and tables
Use to perform statistical tests
Use to generate their conclusions
These papers are the antithesis of reproducible.
It’s common to see the phrase “data available on request” but that’s frequently meaningless:
“Data requests to authors are successful in 27–59% of cases, whereas the request is ignored in 14–41% cases”
Tedersoo et al 202119
Even in cases where data is returned it’s often insufficient for reproducibility, Roche et al20:
“Data” might actually be screenshots of charts stored in an Excel workbook
“Data” might be stored in other “non-machine-readable” formats like PDF or images
“Data” might only be provided in the post analysis form
In one of the earliest and most cited studies in data sharing in 2011 by Tenopir et al21 surveyed 1,329 scientists and found:
“Most respondents (at least 60% across disciplines) agree that lack of access to data generated by other researchers or institutions is a major impediment to progress in science.”
… but
“A majority of all respondents indicate they are not willing to place all of their data in central repositories with no restrictions”
These findings have been replicated again, and again.
Researchers
Open Data helps reproduce previous studies
Open Data means researchers can do new studies, including meta analyses.
Open Data gives an additional way researchers can be cited
There is clear evidence Open Data is linked with higher citation rates, eg Colavizza et al 202022.
You can find some healthcare specific examples on the course website eng7218.netlify.app/resources/open-data. Please do research your own - and consider sharing them with the group.
Big Data is great. It’s the driver behind the Internet of Things and much of modern healthcare technology.
But in most circumstances it is not Open Data.
Screenshot of opendefinition.org26
Let’s minimise these definitions:
Open data must be..
Legally open. The data must be subject to an open data license
Technically open.
Data files must be machine-readable and non-proprietary, which often means plain text.
Accessible from a public server without password protection
Or even more succinctly:
Open data must be open to humans and computers.
Please note this definition of open data for your assessment.
In most cases the Creative Commons’ “Choose a license tool” is the best and easiest choice if you have a dataset you want to make into Open Data.
There are special Open Data licenses used by Governmental/Charity organisations designed to waive liability for use, eg the Open Government License from the UK Government27
In general it’s best to use data licenses for data and software licenses for software.
There are SO MANY different sources (or publishers) of Open Data, for a good sample checkout the Open Data Essentials page from the World Bank28.
Open Data is awesome. But if we’re responsible for data that can identify individuals or groups we have a [legal] duty of care to protect that data.
In the UK we have the Data Protection Act29 which is the UK’s implementation of the General Data Protection Regulation (GDPR).
In week 6 we will discuss GDPR and the Data Protection Act in the context of anonymising data.
This isn’t a course in the law school so we won’t go hard into the definitions. But there are some things we need to discuss.
In the DPA29 there are 6 different types of sensitive data defined in section 86
I’m highlighting the ones that cover data that you might reasonably collect and consider “health data”.
(a) the processing of personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs or trade union membership;
(b) the processing of genetic data for the purpose of uniquely identifying an individual;
(c) the processing of biometric data for the purpose of uniquely identifying an individual;
(d) the processing of data concerning health;
(e) the processing of data concerning an individual’s sex life or sexual orientation;
(f) the processing of personal data as to [commission or alleged commission of an offence]
There’s some recursion, so let’s pull out the definitions of “health data” from Part 7 Section 205 and list everything together
Part 7 Section 205 defines the following
“biometric data” means personal data resulting from specific technical processing relating to the physical, physiological or behavioural characteristics of an individual, which allows or confirms the unique identification of that individual, such as facial images or dactyloscopic data;
“data concerning health” means personal data relating to the physical or mental health of an individual, including the provision of health care services, which reveals information about his or her health status;
“genetic data” means personal data relating to the inherited or acquired genetic characteristics of an individual which gives unique information about the physiology or the health of that individual and which results, in particular, from an analysis of a biological sample from the individual in question;
But we should also still include these sections from section 86:
(a) the processing of personal data revealing racial or ethnic origin,
(e) the processing of data concerning an individual’s sex life or sexual orientation;
There’s more reader friendly documentation about health data and DPA 2018 from the Information Commissioner’s Office30.
We’ll talk about this more later.
UKRI is responsible for the 6 UK research councils who fund most university-based research in the UK.
Some research councils have their own “data sharing policies”, but others depend on the “common principles on on research data”31:
Publicly funded research data are a public good and produced in the public interest. They should be made openly available with as few restrictions as possible in a timely and responsible manner.
Arts and Humanities Research Council (AHRC32: Relies on the UKRI common principles.
Biotechnology and Biological Sciences Research Council (BBSRC33): “BBSRC expects that all data (with accompanying metadata) should be shared in a timely fashion as soon as it is verified”.
Engineering and Physical Sciences Research Council (EPSRC34): Has the most explicit data policy, including a requirement for DOI.
Economic and Social Research Council (ESRC35): Explicit requirement that “data will be made available […] as Open Data”.
Medical Research Council (MRC): There’s an “expectation” from MRC36 that data must be made open, they helpfully provide lots of advice about patient and population data.
Natural Environment Research Council (NERC37): Flubs it by saying all research “must include a statement on how the supporting data and any other relevant research materials can be accessed”
Science and Technologies Facilities Council (STFC38)“:”STFC expects that published data should be made publicly available within six months of publication unless justified otherwise”
Submit a Data Management Plan (DMP) to a funder, including a data sharing plan
Pre-register your research (often simultaneously with step 1)
Reserve a DOI to store your research data at a data repository
Do the research
Keep raw data pristine. Do not modify your raw data.
Keep track of how you wrangle the data (much easier when you write R code!)
Craft an anonymised, shareable version of your datasets described in your DMP.
5. Write up the study
6. Choose a journal to submit to
That’s an entire process in of itself!
7. Make the data deposit public when your research is made public
You might want to create follow up studies or new studies where you add to the existing data deposit.
This works nicely thanks to DOI versioning!
I tweeted asking for advice about things I missed
Hey folks! What did I miss out about the journey of an Open Dataset through the entire research process?
— Charlotte (@charliejhadley) August 19, 2022
(Note: this is customised a bit for the UK)
I'm also very happy for ideas about adding complications or other notes 😀 pic.twitter.com/Sm9nn4byZo
Meta data records including data dictionaries are really useful tools for studies. These can be deposited into data repositories and linked to in your articles.
I haven’t mentioned ethics forms and consent forms. These are very important. Refer to your Research Support teams for help in designing these documents.
So far I’ve only spoken about the data element of reproducibility, we’ll get onto the code in the workshop.
DOI are extremely important to ensure research availability into the future.
Academic journal links are fragile and could change at any time:
DOI are persistent and extremely long-term identifiers that look like this:
10.1016/j.crfs.2022.05.015
The publisher and the DOI Foundation are then responsible for directing you to the resource by constructing a URL like this:
doi.org/10.1016/j.crfs.2022.05.015
Initially DOI were only issued by academic publishers to resolve journal articles.
Data Repositories started issuing DOI so we could resolve links to data, code and more.
Specialist Data Repositories
Sometimes you need a repository with specialist features, eg:
Genome sequences
Protein sequences
Climate data
Nature Publishing Group39 provides an excellent overview of these tools.
General Purpose Repositories
These tools all have slighty different advantages and disadvantages
Figshare
Zenodo
Open Science Framework
DOI are great for resolving our research outputs, what about uniquely identifying researchers?
Researchers often share the same names or change their names throughout their career.
The only open researcher identifier is ORCID.
Thankfully - it also works really well!
It keeps track of all publications and deposits on data repositories.
Here’s mine: orcid.org/0000-0002-3039-6849.
I want you to register for an ORCID now so you can use it for everything - including in your CV
Figshare.org
This is my favourite repository.
It has a unique DOI versioning system
We can always get the most recent version of an object: doi.org/10.6084/m9.figshare.3761562
Or a specific version: doi.org/10.6084/m9.figshare.3761562.v202
Zenodo.org
This infrastructure hangs off the back of CERN. If we were being pessimistic, this is the most reliable option - the others might fail.
DOI versioning is semantically linked - it takes a little bit more effort to obtain the most recent version of an object
Open Science Framework - osf.io
The OSF tries to solve many problems at once, instead of just being a data repository.
OSF does not support DOI versioning.
The OSF interface is quite complex.
SLIDE 1 OF 1
Everything we’ve spoken about has been very theoretical. I want you to go through the steps of creating a collection on Figshare.
We’re creating a Collection because it can contain multiple Figshare items. At the beginning of your research you likely don’t know exactly how many data files you’ll end up with.
Sign up for Figshare with your ORCID
Go to “My data” in Figshare
Go to “Collections”
Create a Collection
Reserve a DOI
Go to “My data”
Add a new item
Reserve a DOI
Go to your collection and add this data item.
Figshare is very much a general purpose repository.
If you create something you want to make available for the future the best thing you can do is get a DOI.
Consider using Figshare for presentations
Consider using Figshare for posters
It used to be extremely hard to access the research that UKRI funds - despite it being funded by public money.
The Open Access movement began in the 90s and is ever growing.
It’s now a requirement of UKRI31 funding that
the final Version of Record or the Author’s Accepted Manuscript must be free to view and download via an online publication platform, publishers’ website, or institutional or subject repository within a maximum of 12 months of publication
… however, this often means that someone is paying an Article Processing Charge (APC).
APCs can be split between publishers and authors in different ways.
UKRI is usually responsible for paying the author’s portion of the bill.
Gold Open Access: All articles in a journal are Open Access.
Hyrbrid Open Access: Specific articles in a journal are made Open Access through APCs. Journals receive money through both subscriptions and APCs.
No APC is paid
The most important category here is Green Open Access.
In Green Open Access the author self-archives their article in a publicly available repository.
This gets complicated.
Some publishers require that only pre-prints are self-archived
Some publishers allow post-print publishing.
If you’re interested read Gadd and Troll Covey40.
There is a significant and very clear bias to researchers publishing “positive results”41 - which you can even see in article titles.
This poses significant issues in the literature.
It’s really useful to know X doesn’t work! It means others don’t need to repeat the result.
Negative effects can be under reported, as per the GSK lawsuit in 2004.
A lack of negative results introduces bias to meta-analyses.
The MRC Open Research Data policy36 explicitly requires both positive and negative results of studies be published within 24 months of the trial end.
This is an open problem.
There was a push for new negative result journals in the early 2010s but several of these folded, including the Journal of Negative Results in Biomedicine The remaining journals have very low impact factors as they are published by smaller publishers.
In general the journals with the highest impact factors are getting better at publishing negative results.
tweetrmd::tweet_embed("https://twitter.com/charliejhadley/status/1559534088838647808?s=20&t=M0f0BvpbqLiKpDxBHiJy7w")When did you as a student/academic researcher first learn about reproducible research methods?#reproducibleresearch #reproducibility
— Charlotte (@charliejhadley) August 16, 2022
(More polls on this below, sorry they don’t have a “show results option”)